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Abstract 

In this paper, we study the problem of sparse multiple kernel learning (MKL), where 
the goal is to efficiently learn a combination of a fixed small number of kernels from a 
large pool that could lead to a kernel classifier with a small prediction error. We develop 
CO \ an efficient algorithm based on the greedy coordinate descent algorithm, that is able to 

' achieve a geometric convergence rate under appropriate conditions. The convergence rate 

! is achieved by measuring the size of functional gradients by an empirical £2 norm that 

' depends on the empirical data distribution. This is in contrast to previous algorithms that 

\ use a functional norm to measure the size of gradients, which is independent from the 

data samples. We also establish a generalization error bound of the learned sparse kernel 
classifier using the technique of local Rademacher complexity. 
^> ' Keywords: kernel methods, multiple kernel learning, greedy coordinate descent, general- 

s' , ization bound 

.9. 

1. Introduction 

Kernel methods have been studied extensively, thanks to their empirical success in a variety 
of applications. Examples of kernel methods include support vector machines (SVMs), ker- 
nel ridge regression, kernel clustering, kernel PCA, and many others. It is well known that 
the choice of kernel function can be crucial to the success of kernel methods. Although, in 
principle kernel can be chosen by standard model selection methods such as cross valida- 
tion, the high computational cost makes it unattractive. Over the past decade, significant 
progress has been made to efficiently learn an appropriate kernel for a given task. 

Among the many approaches developed for kernel learning, recent studies have been 
focused predominately on multiple kernel learning (MKL) algorithms. Given a collec- 
tion of kernels, the objective of MKL is to learn a combination of multiple kernel clas- 
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sifiers, one for each kernel function, from the training examples that results in small pre- 
diction error. Many computational algorithms have been developed for multiple kernel 
learning (Lanckriet et al., 2004; Argyriou et al., 2005; Bach, 2008; Argyriou et al., 2006; 
Lewis et al., 2006; Micchelli and Pontil, 2005; Ong et al., 2005; Bach et al., 2004; Rakotomamonjy et al., 
2008; Sonnenburg et al., 2006; Xu et al., 2008; Suzuki and Tomioka, 2011). The analysis of 
generalization error bound for MKL has been developed in several studies (Hussain and Shawe- Taylor, 
2011; Ying and Zhou, 2007; Cortes et al., 2009, 2010; Bousquet and Herrmann, 2003; Srebro and Ben-david, 
2006; Ying and Campbell, 2009), aiming to bound the additional error arising from opti- 
mizing the combination of multiple kernels. These studies have shown that MKL can be 
effective even when the number of kernels to be combined is very large. For instance, the 
generalization error bound from learning a combination of m different kernels, will only 
deteriorate by a factor of log m when the sum of kernel combination weights is bounded. 

Despite the encouraging results, one problem with MKL is that the resulting classifier 
can be a combination of many kernel classifiers, leading to a high computational cost in 
testing. We address this challenge by developing efficient algorithms and theories for sparse 
multiple kernel learning. The objective of sparse MKL is to learn a sparse combination of 
multiple kernel classifiers involving no more than d kernels, where d <^ m is a predefined 
constant. 

We develop a simple algorithm for learning such a sparse combination of multiple kernel 
classifiers, and present the analysis bounding the generalization performance of the learned 
kernel classifier. Our algorithm is an iterative algorithm based on the greedy coordinate de- 
scent algorithm (Shalev-Shwartz et al., 2010; Nesterov, 2010; Yun et al., 2011). To generate 
a sparse MKL solution involving no more than d kernels, at each iteration, our algorithm 
adds to the existing pool the kernel with the largest gradient. The size of gradients is mea- 
sured by an empirical £2 norm that depends on the training examples. Under appropriate 
condition, the proposed approach is able to achieve a geometric convergence rate. To the 
best of our knowledge, this is the first algorithm for sparse MKL that achieves a geometric 
convergence rate. 

Although several algorithms have been developed for sparse MKL by exploring differ- 
ent forms of regularization (Vishwanathan et al., 2010; Kloft et al., 2009; Orabona and Jie, 
2011), none of them are able to establish the generalization error bound for a MKL solu- 
tion involved a fixed number (i.e., d) of kernels. We also note that our work differs from 
the studies on the sparsity of MKL (Koltchinskii and Yuan, 2008, 2010) which focus on 
bounding the sparsity of combination weights for kernels and do not address our problem 
directly. 

The most related work to this study is (Sindhwani and Lozano, 2011), where a group 
orthogonal matching pursuit (GOMP) algorithm is applied to learn a sparse combination 
of kernel classifiers with exactly d kernels. Unlike previous formulations for sparse MKL 
that use ii regularization (i.e. )i the authors propose to use £2 regularization 

(i.e. ^ - ) together with a sparsity constraint (i.e. io constraint) for sparse MKL. 

Although they did not present a convergence analysis for the proposed algorithm except 
for a sparse recovery analysis, we can apply the analysis in (Shalev-Shwartz et al., 2010) 
for smooth functions to their algorithm to obtain a 0{l/d) convergence rate. The group 
orthogonal matching pursuit algorithm is similar to the greedy coordinate descent algorithm 
used in this study except that we measure the size of gradients by an empirical £2 norm 
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while it is measured by a functional norm in (Sindhwani and Lozano, 2011). It is this 
difference that leads to a geometric convergence rate for the proposed algorithm which is a 
significant improvement over the rate of 0{l/d). 

Outline of contributions. The following contributions are made in this paper: 

• We present a baseline algorithm, based on the greedy coordinate descent method, that 
achieves 0{l/d) convergence rate when using ii norm functional regularizer. 

• We introduce an empirical ^2 norm to measure the size of functional gradients in 
the application of greedy coordinate descent algorithm to sparse MKL, and achieve a 
geometric convergence rate under appropriate conditions. 

• We study the generalization performance of the proposed algorithm. Specifically, 
we derive an upper bound on the generalization performance of learned classifier 
using local Rademacher technique that has a additive term of O (^^lum/N^ , which 

matches the existing bounds in their dependence on m (i.e., the number of kernel 
functions) and (i.e., the number of training samples). 

Our paper is organized as follows. In the next section we formally introduce the prob- 
lem of sparse MKL. In section 3 we present our baseline algorithm with its convergence 
analysis. Section 4 introduces the main algorithm proposed in this paper with analysis of 
its convergence rate and generalization bound. We wrap up in Section 5 with a discussion 
of possible directions for the future work. 

2. Problem Setting: Sparse Multiple Kernel Learning (MKL) 

Let D = {(xj,yj),i = 1, . . . , A^} be a collection of training examples, where Xi G X and 
Ui G {—1, +1}, and let {Kj{-, ■) : X x X ^ M., j £ [m]} be a collection of reproducing kernels 
to be combined, where [m] denotes the set {1, • • • ,m}. Let {Hj-,] G be the associated 
Reproducing Kernel Hilbert Spaces (RKHS). We denote by y = (yi, . . . ,yN)^ the outputs 
for all the instances in T). For the convenience of analysis, we assume Kj(x, x) < 1 for 
any x S A' and any j G [rri\. The goal of MKL is to learn a function / = Y^JLi fj^ 
where fj E T^jjj £ [m], that has a small generalization error. A common approach for 
MKL is to learn the combination of kernel classifiers by solving the following optimization 
problem (Micchelli and Pontil, 2005) 



where Ti = {f = X]j=i /?' • fj ^ ^jl' (^{z,y) = {z — y)^/2 is a square loss ^. In 
this study, we assume that the number of kernels m is very large (could be larger than 
the number of training examples A^), and our objective is to learn a combination of kernel 
classifiers involving no more than d kernels, where d ^ m is a predefined constant. For the 
convenience of discussion, we define by £N{f) = Z^i^i ^(/(^j)) the empirical loss for 

1. Although we restrict our discussion to square loss, it is straightforward to extend our result to the 
quadratic-type loss function defined in (Koltchinskii and Yuan, 2010) 
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kernel classifier /, by ||/|| = X^JLi ll/jll'Hj the norm of a combined kernel classifier /, and 
by J{f) = {j £ ["^] : fj / 0} the subset of non-zero kernel classifiers used to construct /. 
Finally, we define /* the optimal solution to (1), i.e., 

/* = argmin£(/). (2) 

fen 

Note that according to (Micchelli and Pontil, 2005), the problem in (1) is equivalent to 
the following optimization problem 

A' 1 ^ 

min mm -\\f\\l + — y if {xi)-yif, (3) 

where A' > is an appropriately chosen parameter depending on A in (1), and T-L^^ is 
a RKHS endowed with a combined kernel function k(-,-;7) = Xll^i 7«^j('i ')• It is not 
difficult to show that computed in (3) is proportional to ||/j||-Hj computed from (1). As 
a result, choosing the kernel classifiers with the largest functional norm in (1) is equivalent 
to choosing the kernels with the largest weights 7^ in (3). 



3. Warmup: A Greedy Coordinate Descent Algorithm for Sparse MKL 

A straightforward approach for sparse MKL is a two-stage scheme: it first learns a com- 
bination of all m kernels by solving the problem in (1) and then only keeps the d most 
"important" kernel classifiers fj in the combination. To select the most important kernel 
classifiers, a simple approach is to choose the kernel classifiers with the largest functional 
norm , because is proportional to the combination weight 7^ in (3). It is 

however easy to construct a counter example to show that the two-stage scheme fails to find 
the best kernel. In particular, we will show that for two cases that have the same sets of 
unique kernels, the two-stage scheme chooses different kernels. In the first case, we have two 
kernel functions •) and K2{-, •)• Using multiple kernel learning, we can learn the weights 
for both kernels. Let the learned weights be 0.8 for and 0.2 for K2{-,-)- According 

to the two-stage approach, we will select kernel In the second case, we have 10 

identical copies of ki{-, •) and one copy of K2- Since both cases share the same set of unique 
kernels, we expect the same kernel to be selected by the two-stage approach. However, 
based on the symmetric argument, it is straightforward to show that the weight for K2(-, •) 
remains unchanged while the weights for the copies of are reduced to 0.08. As a 

result, the two-stage approach selects kernel K2{-, •) for the second case, a different kernel 
from the first case. Another problem with this two-stage approach is its high computational 
complexity since it requires solving an optimization problem involved all kernel functions, 
even including the ones that are totally irrelevant to the target prediction task. 

As the first step, we present a baseline algorithm that extends the greedy coordinate 
descent algorithm (Shalev-Shwartz et al., 2010) to solve the £1 regularized MKL in (1) 
and achieves a 0(1 /d) convergence rate. The basic steps are shown in Algorithm 1. At 
each iteration k, Algorithm 1 selects the kernel with the largest gradient measured by its 
functional norm, denoted by jk, and expands the set of selected kernels Sk to Sk+i by 
including jj.. It then searches for the optimal combination of kernels in the set that 
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Algorithm 1 A Greedy Coordinate Descent Approach for Sparse MKL with £i Regular- 
ization 

1: Input: A > 0: regularization parameter, d: the number of selected kernels 

2: Initialization: = 0,j G [m] and Sq = 0. 

3: for /c = 1, . . . , d do 

4: jk = argmaxj-g[^] \\Vj£N{f''''^)\\-^^ 

5: Exist the loop if II Vjj^<?Ar(/*="^)|L <A 

6: Sk = Sk-i U {jk} 

7: Update the kernel classifier by solving the following optimization problem 

/'= = argmin£(w) = A||/||+£:;v(/) (4) 
Jif)=Sk 

8: end for 

9: Output / = f-^ 



minimizes the objective function C{f). Note that although the objective in (1) is non- 
smooth due to the non-smooth regularization term 'Y^=i ll/jll'Hj; we are still able to obtain 
a 0{\/d) convergence rate as shown in Theorem 1. The magic lies in step 4, where instead 
of choosing the coordinate with the largest gradient with respect to the objective function 
£(/), we choose the coordinate with the largest gradient with respect to SnU), the smooth 
part in the objective function, i.e. 



|v,^:^(/)ll^^^ 



1 ^ 

-y 



On the other hand, in step 7, we update the multiple kernel classifier by solving the (i reg- 
ularized MKL. It is this special design that makes it possible to achieve 0{l/d) convergence 
rate even for the non-smooth objective function in (1). We finally note that Algorithm 1 
is similar in spirit to the GOMP based approach (Sindhwani and Lozano, 2011) and share 
the same convergence rate. The main difference is that we directly solve the ii regularized 
MKL in (1) while in (Sindhwani and Lozano, 2011), a £2 regularizer is used and the sparsity 
is enforced through a constraint based on the io norm. 

The following theorem shows the performance guarantee of the solution obtained by 
Algorithm 1 where its proof is given in Appendix A. 

Theorem 1 Let f be the solution output from Algorithm 1. If f is obtained by exiting from 
the middle of the loop, we have C{f) = C{f*). Otherwise, we have 

£N{f) + Aii/ii < £N{f*) + A|ir II + ^iir f . 

It should be emphasized that although the analysis in (Shalev-Shwartz et al., 2010) 
shows that the greedy coordinate descent approaches enjoy a geometric convergence rate 
when the objective function is both strongly convex and smooth in its variables, it can 
not be applied to our problem directly. This is because although the loss function £{z, y) 
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used in the regression is both strongly convex and smooth in the argument y, it is not 
strongly convex in {/jj^L^ because the prediction is given by X^JLi fj{^)- next section, 
we present another approach for sparse MKL, based on greedy coordinate descent, that is 
able to achieve a geometric convergence rate under appropriate conditions. 



4. A Geometrically Convergent Algorithm for Sparse MKL 

In this section, we present an algorithm for sparse MKL that can achieve a geometric 
convergence rate under appropriate conditions. 

We first argue that selecting kernel classifiers based on their functional norm may not 
necessarily be the best idea. This is because in order to ensure a removed kernel classifier fj 
to have a small impact on the overall regression error, we should be mostly concerned with 
E[|/,-(x)|2], instead of \\fj\\ny To see this, we bound E[|/(x) - yp] - E[|/(x) - /^(x) - yp], 
which measures the impact of removing fj from / 



E[|/(x)-/,(x) 



E[|/(x) 



= E[|/,(x)|2]-2E[/,(x)(/(x)-y)] 

< E[|/,(x)|2] + 2 Je[|/,(x)|2] VE[|/(x) - 



Although > 1 1 /j I loo > Y^E[|/j(x)P], there could be a significant gap between 

and Y^E[|/j(x)p] (Smale and Zhou, 2007), making it possible for the functional norm based 

criterion to remove the kernels that are important in the final prediction. 

Based on the above discussion, we propose to measure the size of kernel classifiers fj by 
its £2 norm, i.e., y/E\fj{x)\'^. Since the distribution of x is unavailable, we introduce the 
empirical counterpart of a/E|/j(x)P, called empirical £2 norm and denoted by | |/j 11^2(1') • 
Given fj = Yl^=i ctji'^ji'^ij ■)> its £2(1^) norm is computed as 



II/: 



\ 



1 ^ 



a=l 



a=l \b=l / 



N 



\Kia 



jaj\\2, 



(5) 



where Kj = [Kj(X(j,Xfe) 



NxN 



is the kernel matrix for n 



and a 



, OjN 



For 



the purpose of our analysis, we also define an empirical £2 norm for the combined classifier 



J — l^j = l Jj — 2^j = l 



as 



11/11^2(1') 



1 ^ 



(6) 



One way to exploit the empirical £2 norm for sparse MKL is to incorporate it into (1) 
as part of the regularization, leading to a mixture regularizer that is consisted of both 
||/j||-^^. and ll/j ||£2(l')- A similar formulation is suggested in (Koltchinskii and Yuan, 2010). 
It is however unclear as how to efficiently solve the related optimization problem to achieve 
a convergence rate better than 0{l/d). Instead, we will use the empirical £2(1^) norm to 
measure the size of gradients when performing greedy coordinate descent optimization. Our 
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Algorithm 2 A ^2(^) Norm based Greedy Coordinate Descent Approach for Sparse MKL 



Input: A > 0: regularization parameter, d: the number of selected kernels 
Initialization: f'- = 0, j G [m] and 5( 







for k = 1 d do 

11x7 .r^r( fk-i\\\ 



jk = argmaxj-g[„] \\Vj£N{f'' ^)| 
Update the kernel classifier as 

1 ^ 

f' = f-'-fj., where/,, = -J]a^/.,,(x„.) (7) 

i=l 

where a'^ = I ... I is the projection of £'{f^^^) =1 ... 1 into the 



''N 



space spanned by the column vectors of the kernel matrix Kj,^ = [Kj^.{xa,Xb)]i\ixN- 
6: end for 

7: Output f = f 



analysis in subsection 4.1 shows that this modification to Algorithm 1, together with other 
changes, will result in a geometric convergence rate under appropriate conditions, i.e. 

SnU) - minSNif) < O(max(0, (1 - r)'^)), 

where the value of r will be determined by analysis. 

Algorithm 2 gives the basic steps of the new approach for sparse MKL. Similar to Al- 
gorithm 1, at each iteration, Algorithm 2 chooses the kernel with the largest gradient and 
updates the kernel classifier based on the gradient with respect to the selected kernel. The 
key difference between these two algorithms is how to measure the size of the gradients. In 
Algorithm 1, the size of gradient \I j£i\j{f^) is measured by its functional norm, while Algo- 
rithm 2 measures the size of gradient by t2{P) norm of \/j£N{f^)- In addition, Algorithm 2 
follows the idea of gradient descent for updating the kernel classifier and does not require 
solving any optimization problem. However, unlike the standard gradient descent algorithm 
that updates the classifier directly using the gradient. Algorithm 2 projects the coefficients 
of Vj^£]\f{f^~^) into the subspace spanned by the column vectors in Kj before using it for 
updating. This step is critical for the correctness of the algorithm. 

4.1. Convergence Analysis 

To analyze the performance of Algorithm 2, we assume there exists a sparse MKL solution 
that achieves a small regression error. More specifically, we slightly abuse our notation by 
redefining /* as the optimal kernel classifier that minimizes the empirical loss £]\f{f), f as 
the optimal kernel classifier that minimizes the empirical loss using no more than d kernels, 
and e* be the difference in the empirical loss between / and /*, i.e., 

/* = argmin.?Ar(/), /= argmin SnU), e* = SnU) - SnU*)- (8) 
fen fen, \J{f)\<d 
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We assume e* is small, implying that the optimal solution /* can be well approximated by 
a function involved no more than d kernels. 

In order to state our result, we need to characterize the relationship among different 
kernel matrices. In (Koltchinskii, 2011), the author defines quantity f3{b,J,H) to capture 
the geometric relationship for a set of vectors H = (hi, . . . , h^) G 



pNxm 



I.e., 



/3(6,J,F)=inf<j/3>0: J]a2</32 



,VAgC(6,J) V, 



where 6 > is a nonnegative constant, J C [m], and C{b, J) is defined as 

C{b, J) = < A G ii" : ^ A2 < 62 ^ X] 

C(6, J) defines a set of sparse vector in which the components in J dominates over the other 
components measured by their absolute values. When 6 = 0, vectors in C{b,J) only have 
non-zero elements in set J, leading to the standard definition of sparse vectors. (3{b,J,H) 
essentially captures the linearly dependence among vectors in H. For instance, when all hj 
are normalized and orthogonal to each other, we have /3(0, J, H) = 1. We extend /3(6, J, H) 
to l3{d,H) by taking into account all the vectors with no more than d non-zero elements, 

P{d,H) = inf{/3(0, J,F) : J C [m],\J\ < d}. 

We now generalize the above definitions to capture the "dependence" among the kernel 
matrices fC = {Ki, . . . , Km}, where Kj = Kj/N. Since we need to deal with a sparse matrix 
A = (ai, . . . , a^) G M^^™, we extend the definition of C(6, J) to 5(6, J, /C) for sparse matrix 
as follows 



5(6, J, /C) 



(9) 



A = (ai,...,a„) G M^""™ : Ell^ill2 ^ bJ2\\^jh,^j G span(Kj),j = l,...,m 

where span(i^'j) stands for the subspace spanned by the column vectors of Kj. We then 
define quantity 7(6, J,JC) to capture the "dependence" among matrices in /C 



7(6,J,/C)=inf<;7>0:El|ajl|2<7 



,yA G 5(6, J, /C) 



(10) 



We finally define 'y{d, /C) to take into account any matrix A that has no more than d non-zero 
column vectors 



7(d, /C) = inf {7(0, J, /C) : J C H , I J| < 4 . 



(11) 



We note that the value of 7((i, /C) is closely related to the correlation between the subspace 
spanned by any two matrices in /C. For example, when subspaces spanned by each matrix 
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Kj are orthogonal to each other and let the minimum non-zero eigenvalues of Kj , j G [m] 
be larger than (t^^^ < 1, we have 7(d, /C) < Vd/cr^^^y More generally, if we let S{}C) denote 
the correlation between the subspace spanned by any two matrices in /C, defined as 



l<i<i<da„a, \\K^ai\\2\\Kjaij\\2' 



S{IC) = max max 



The following proposition shows the relationship between 'y{d, /C) and 5{IC) when S{IC) is 
small. 

Proposition 2 If 5{IC) < — — -, the following inequality holds for ^{d,lC) and 5{K,), 

7{d,IC)< ^ 



Vl-(d-lM/C)a+, 

where 0"^;^ is a lower bound of the minimum non-zero eigenvalues of Kj , j E [m] . 

Remark: The correlation between different kernels has beed used in the previous stud- 
ies for proving learning bounds for multiple kernel learning. For example, in (Cortes et al., 
2009), the authors derived generalization bounds for kernel ridge regression with £2 regu- 
larization on multiple kernels in the case where the kernels are orthogonal. 

The following lemma shows that when ^{2d,lC) is bounded, the solution / of the Algo- 
rithm 2 converges to /* in a geometric rate. 

Lemma 3 Let f he the solution output from Algorithm 2, and {f*,f,s*) be defined in (8). 
For any fj, > 1, we have either £nU) ~ ^N{f*) < l^i^Nif) — ^Nif*)) or 

^^(/)-^:7v(r)<^[max(0,l-T)]^ 

where t is defined as 



8/i(/i + l)7(2d, /C) 

The proof is deferred to Appendix B. 

As indicated by Lemma 3, Algorithm 2 achieves a geometric convergence rate of (1 — r)*^, 
where r depends on the parameter 7(2d, /C). In particular, the smaller the j{2d,IC), the 
faster the convergence. One shortcoming with Lemma 3 is that it does not give the explicit 
expression for bounding <?Ar(/) — £N{f*) because the bound depends on parameter ^. The 
following theorem makes the bound more explicit. 

Theorem 4 Let f be the solution output from Algorithm 2, and (/*,£*) he defined in (8). 
If the number of selected kernels d is sufficiently large, i.e., 

d > 167(2(i, /C) In 

then we have 



12e* 



£N{f)-£N{f*)<6e*. 
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Proof According to Lemma 3, we have 



SnU) - £N{f*) < minmax ( fie^, I; [max(0, 1 - r)]' 

It is straightforward to show that for any z E [0, 1), if /i > (2 + z)/(l — z), we have r > 2;/[87], 
where 7 = 'y{2d, K,). We thus have 

£N{f)-£N{f*) < min max [ ,\ exp(-dz/[87]) ) < min max ( , exp (2zln(12£*)) ) 

ze[o,i) \l — zz J ^G[o,i) \l — z I J 



The optimum of R.H.S is achieved when 

3e* 1 



exp (2zln(12e*)) 



I- z 2 

Under the condition given in the theorem, we have the above equation satisfied \i z = 
1/2. We also note that the solution to the above equation is unique because — 
^ exp (22; ln(12e*)) is monotonically increasing in z. We complete the proof by plugging 
z = l/2. ■ 



4.2. Generalization Bound 

As previously mentioned, there is a rich body of literature deahng with the generalization 
error bounds of MKL algorithms (Hussain and Shawe- Taylor, 2011; Ying and Zhou, 2007; 
Bousquet and Herrmann, 2003; Srebro and Ben-david, 2006; Ying and Campbell, 2009). In 
the remarkable work of (Lanckriet et al., 2004), a convergence rate of 0{-\/ m/N) has been 
proved for MKL with £1 constraint. After that, this bound is improved utilizing the pseudo- 
dimension of the given kernel class in (Srebro and Ben-david, 2006). Cortes et al. (2009) 
studied the problem of multiple kernel learning with ^2 regularization for regression, and 
derived learning bounds that have an additive term 0{^^/rn/N) when kernels are orthogonal. 
In (Cortes et al., 2010) new generalization bounds for the family of convex combination of 
kernel function with ii constraint were presented which have logarithmic dependency on 
the number of kernels (i.e., \/lnm). It is worth mentioning that although the mentioned 
generalization bounds differ in their dependency on the number of base kernels, however, 
all convergence rate presented are of order with respect to the number A'^ of samples. 

It is worth mentioning that although the mentioned generalization bounds differ in their 
dependency on the number of base kernels, however, all convergence rate presented are 
of order l/\/iV with respect to the number A^ of samples. Recently, (Kloft and Blanchard, 
2011) utilized local Rademacher complexity and derived a tighter upper bound with respect 
to A^ for £p norm MKL by considering the decay rate of eigenvalues of kernel matrices. 
Suzuki (2011) presented a unified framework to derive the bounds of MKL with arbitrary 
mixed-norm type regularization. 

To present the generalization error bound for the sparse MKL solution obtained by 
Algorithm 2, we introduce the following bounded RKHS %{R) as 

{m m 
f = Y.fr- fj e -HjJ e[m],J2 WfjWn, < R 
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The generalization error bound is stated in the fohowing theorem. 

Theorem 5 Let f be the solution output from Algorithm 2, (/*,£*) he defined in (8),and 
he the optimal function for minimizing the expected loss in%{R), i.e. = arg min £{f)- 

Assuming A > 1, m > ?>, and ylln(m + 1) < N < 2^^^, we have either \\f — /^|| < 
8max(i?, y/d)/^fN or with a probability at least 1 — (m + 



£{f) - £{fR) < £N{f) - £N{fl + 196(i? + Vd)'^J AHm + i) ^ 
Under the assumption d > 167(2(i, /C) In (j^^), we have 



m - £im < 6.* + i96(i? + 

Remark: First, we should note that there is a tradeoff in the generalization bound with 
respect to d, since e* could increase when d decreases. Second, the generalization bound 
of the proposed algorithm for learning a combination of no more than d kernels has an 
additive term 0{dy/liim/N), which deteriorates by a factor of d compared to previous 
learning bounds of MKL. Third, if we assume e* is small, e.g., in the order of 0(A^^^/^), 
and7(2d,/C) < 0{Vd), we can let d = 0(ln^ A^), i.e. learning a combination of no more than 
0(ln^ N) kernels, and we have the generalization error of the proposed algorithm bounded 
by 0(ln^ N y^lnm/N) , which only deteriorates by a factor of In^ N compared with the best 
known learning bound of MKL (i.e. 0{^^/lnm/N)). 

In order to prove Theorem 5, we need the following lemma to bound the concentration 
of regression error, where (l o /)(x, y) = i{f{x),y), and -Pat and P are defined by 



1 ^ 



i=l 



for any function F that takes (x, y) as input. 



Lemma 6 Define rg = 8R/ v N and L = R + 1. Let g £ 'H{R) he a fixed function. Assume 
A>1, and Aln(m + 1) <N< 2™+^ With a probability at least 1 - (m + for any 

f G 'H{R), and any r > tq, we have 



sup \(P-P^)(iof-iog)\< 88Lr\/ ^^"^""^^^ 

The proof of Lemma 6 is provided in Appendix D. We are now ready to prove Theorem 5. 
Proof [of Theorem 5] First, we show that the solution / obtained by Algorithm 2 has a 
bounded functional norm 11 f 11. We have 



d 

k=l 



d 

' llf„-_ W-u. . 

Ik 



k=l 
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Following inequality (14) in the Proof of Lemma 3, we have 



1 



\y,,£N{f')\\n, 



According to the inequality in (13) in the Proof of Lemma 3, we have 



\m\n,,<HSNif-')-£N{f)), 



due to \\Vj,£N{f)\\i,iv) < ||V,,£:Ar(/'=)||^^.^. Hence 



< 



Ell/: 



jkWn 



k=l 

Second, we have 



\ IZwfjkWk, ^ V'^ds^iP) < v^iiyii2 ^ ^- 

\ k=l 



£{f) < £{rR) + £N{f)-£N{rR) + £{f)-£N{f) + £N{rR)-£UR) 

< £•(/*) + £:^(/)-£:^(/*)+ sup |(p-p^)(^o/-£o/*)| 

< £{f*^)+£^{f)-£^{f*)+ sup |(P-P^)(/-/*)|. 

feniVd) 

Using the Lemma 6, we have either ||/ — /^|| < 8 max(i?, \/d) /\/iV, or with a probability 
at least 1 - (m + l)"^+\ that 

sup |(P-P^)(^o/-^o/*)| < sup \{P - Pn){1 o f - e o g)\ 



\\f-g\\<R+Vd 



< 



(max(i?, Vd) + 1){R+ Vd)^^M^^±^ 



leading to 



£if) < £ifR) + £Nif) - £N{n + i96(i? + ^df^j + 



We complete the proof by plugging the result from Theorem 4. 



5. Conclusion 

In this paper, we developed an efficient algorithm for sparse multiple kernel learning (MKL) 
based on greedy coordinate descent algorithm. By using an empirical £2 norm for measuring 
the size of functional gradients, we are able to achieve a geometric convergence rate under 
certain conditions. We also prove the generalization error bound of the proposed algorithm. 
As the future work, we plan to provide better quantization about the independence among 
kernel matrices, a key condition for our algorithm to achieve geometric convergence. 
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Appendix A. [Proof of Theorem 1] 

First, we bound the difference between C{f^) and C{f*) and show that for A; > 1, the 
following holds 

oil f*||2 

C{f^+')-C{f*)<^^. (12) 

Similar to the standard theory of greedy algorithm (Shalev-Shwartz et al., 2010), we 
have 



'J n, ' 



J = l 



n 



where 6j G Since f'' is the optimal solution of Swif) + '^||/|| on the support 

J(/'=), we have V.SNif'') + Xdj\\f^\\n, = 0,Vi € Jif''). By choosing 



max{X,\\Vj£N{f'')\\H,y 



we have 



c{f')-c{f*) < (-/;>v,£:;v(/')+A5,) <iiri 



max \Vj£Nif )\n, - A 



+ 



where [z]+ = max(0, z). The above inequality indicates that if maxjg[m] ||Vj£^Ar(/'^')||-^j. < A, 
f^ is the optimal solution, we thus exist the loop. 
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In the following, we assume maxjg[„] \\^j^N{f^)\\'Hj > ^- We have 

C{f+') = min fiv(/) + A||/|| 
j(j)='5fc+i 

m 1 ^ 

< min + AII/II+ 5] V,£:^(/'=)) + — 5](/(x,)-Ax,))2, 

where the inequality follows the definition of £]\f{f). To bound the R.H.S., we consider the 
following construction of / 



Using the above solution /, we have 

2 N 
i=l 

Since the above inequality hold for any r/ > and j^^i = argmaxj ||Vj<Sjv(/^)||, we have 
C{f+') < ^(/'^j+min-^ max||V,£:^(/'=)||^^.-A 

< £(/^)+min-r?(max||V,f^(/'^)||^^, -a) +^ 



max||V,f^(/'=)||H, - A 



where the second step follows ||(7jj._,_i < 1 and therefore Ig'^^.^j (xj)| < 1 since Kj(xi,Xi) < 
1. As a result, when maxjg[„] \\'S/j£N{f^)\\'Hj — A > 0, we have 



^ {ci f)-c{ny 

Define = C{f ) - C{f*). We have 

1 1 ^ C{f)-C{f+^) ^ 1 



ek el ' 2\\f*\\^' 

leading to the result in (12). 

Next, we consider two cases. In the first case, if / is obtained in the middle of the loop, 
we have maxjg[^] ||Vj<?Ar(/)||-^^. < A, and therefore have £(/) = C{f*). If / is obtained by 
finishing all the loops, using (12) , we have the desired rate as 

211 TIP 
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Appendix B. [Proof of Lemma 3] 

Similar to the proof of Theorem 1, we have 



According to the representer theorem, we have 



fj (^) = "li^^ > > fji^) = Yl Olj^iKj (Xi , x) 



i=l 



where aj = • • • , ^ — • • • > ^j,n) € are vector representation 

of function fj and fj. Due to the projection step in updating the kernel classifier (step 5 
in Algorithm 2), we have G span{Kj). It is also safe to assume aj S span(i^j) because 
otherwise we can always project ctj into the subspace span(Kj) without changing the value 
fj{xi),i £ [N], and therefore without change Sj^if)- We define a norm || • ||a as 



Wf^W, = VN\\a';h, ||/,-||a = ViV||a,||2. 
Using these notations, we rewrite £N{f^) — Swif) as 

m m 



max 

l<jr<m 



where the second inequality follows from Cauchy inequality and the definition of ^2(^) norm 
of \\'^j£N{f'')\\^^(j)) that is given by 



N / N \ ^ 

a=l \ b=l / 



-\\K,£'{n/N\\i 



where = {£' {f (xi) , yi) , ■ ■ ■ ,i'{f''{xN),yN)) ■ Using the following equality 



i=l 



where a = {a^ , ■ ■ ■ , cij^ ) is the projection of vector £ (/ ) into the subspace span(/Cjj.^^^ 
we have 

m N 



i=l 
ek\\\2 



(13) 
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where we use = f^^j ^ jk+i, 



Hi 



1 

IV 



k\\\2 



1 ^ 1 1 

i=l 

and the fact \\'^j£N{f^)\\i2{v) < \\'^j£N{f'^)\\'Hj- As a result, we have 

(£N{f)-£Nif)]' 

£Nif'')-£N{f''-^')> 



(14) 

2 

^2(X') ' 



2(Er=i ii/.-/j=iu) 



Define 6i 



Qj,j G [m]. Since a^' G span(i(rj) and Qj G span(Kj), we have Sj G 

span(ir,). Since we assume / is a combination of no more than d kernel classifiers, there 
are at most 2d non-zero vectors in the set {Si,. . . , Sm}- Using the definition of 'y{d, /C), we 
have 



j=l j=i V 

To simplify our notation, we define 7 = j{2d,)C). We have 

£N{f)-£N{f+') > (Mi^i^j^iiini > 



7(2d,/C)||/'=-/||,,(i,). 



{£Nif)-£N{f)? 



> 



n\f'-f\\u^) ~ 4i(\\p-r\\l + \\f-f 

iSNif)^£N{f)? 

87 {sNin - £N{n + £nU) - £N{n) ' 



*l|2 



The last step in the above inequality follows the fact that /* is the minimizer of the empirical 
loss £N{f) and therefore 

1 ^ 1 

£N{f) - £N{n > ^ - /*(^^))' = 9 11/ - f*\ 



|2 



i=l 



Let k{ii) be the iteration index such that for any k < k{ii) wc have £N{f^) — £N{f*) > 
A*(^Ar(/) — £N{f*)) = IJ-s:*, where /x > 1. Then, for all k < k{n), we have 



8jfi{fi + 1) 



£N{f 



£ 



N 



if)) 



Define = £N{f'') - £N{f) and r = ^j;:^^^- Then, for any k < k{^), we have e^+i < 
max(0, 1 — T)efc and therefore 

efc < [max(0, 1 - T)feo = [max(0, 1 - r)]'=Ml < i [max(0, 1 - , 
leading to the result in the lemma. 
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Appendix C. [Proof of Proposition 2] 



We only need to prove 7 



Vl-(d-l)<5(/C)<7+„ 



satisfies the following inequality 



To prove this, we let z = (||aj||2, j G J), and proceed as follows: 



> 7'(^iin)MEii^^-iii-'^(^) E ii^^ibNib 



i6J 



.V'^min; l^||a,||2| (l-(d-l)<5(/C)). 



Plugging the values of 7, we prove the required inequality. 



Appendix D. [Proof of Lemma 6] 

We first bound the concentration of regression error for fixed r. Using the Telagrand 
inequality (Koltchinskii, 2011), we have with a probability 1 — e~* 



sup \{P-PN){eof-eog)\ 

\\f-9\\<r 



< 2 E 



< 2 E 



sup \{P-PN){iof-iog)\ 

Jl/-9ll<'- 

sup \{P-PN){eof-iog)\ 

\\f-9\\<r 



+ ^P{£of-£og)^^+\£of-eog\^-L 



^ , t Lrt 



We now bound the expectation E 



sup \[P-PN){lof-log)\ 

\f-g\\<r 



We have 



E 



sup \{P-PN){llof-log)\ 
\\f-9\\<r 



< 2E 



N.o 



sup Rn{iof-iog) 

\\f-9\\<r 



< ALE 



sup RnU - g) 

\\f~9\\<r 
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where RnH) = jj X^i^i ^ifi'^i) is the Rademacher complexity measure and ai,i = 1, . . . , N 
are Rademacher variables. The last inequality follows the contraction property of Rademacher 
complexity measure (Koltchinskii, 2011). To continue bounding the quantity, we first notice 
that 



sup i?Ar(/ — g) <r max 

ll/-g||<r 1^^'^™ 



sup RN{fj-gj] 



This is because 



.II/j-9jIIKj<1 



N m 



sup RnU - g) 

\\f-9\\<r 



f._,,.iL, <-i iv — 1 — : 



E7=ill/.-9.ll«,<i 



i=l j=l 



r sup 



N 



1=1 



fAA)-93i.<) 
Wfj-QjWUj 



< r max sup RN{fj—gj)- 



Using Theorem 5 from (Hussain and Shawe- Taylor, 2011), we have, with a probability 1 
e~*, that 



max sup RN{fj-gj] 



< max Eo- 

l<j<m 



sup RN{fj-gj) 



ll/j-9jllHj<l 



+ 4 



ln(m + 1) + t 



2iV 



1 / ln(m + l)+t 

— 2iV 



where the last step uses the fact Kj(x,x) < 1 and the result from (Bartlett et al., 2002). 
Combining the above results and setting t = j41n(m + 1), we have with a probability at 
least 1 — 2(m + 1)"^, for a fixed r, 

sup \{P-PN){eof -eog)\ 

\\f-9\\<r 



^ , 4 / (yl + 1) ln(m + 1) /Aln(m + 1) ^ln(m+l) 
< 2Lr I ^= + 16\ — h \ — h - 



2N 



N 



N 



A ln(m +1) ^ ln(m + 1) 
< Lr I 42a/ \, ^ +2- ^ ^ 



(15) 



Now, we show the bound holds uniformly for all r G (rp, 2R). Note that r cannot be larger 
than 2R because Yl^i \\fi~9i\\'Hi < To this end, we consider Rj = 2^~^ R,j = 0, . . . , jo, 
where jo < [log2[2i?]— log2 tq] < 0.51og2 A/"— 1. Then, with probability l—[log2 A^](m+1)~"^, 
we have (15) hold for all {RjYjLQ- Using the monotonicity with respect to r, for any r > tq. 



we have 



\\f-y\\<r 



. . M I ;^ln(m+l) ^ln(m+l)\ /Aln(m + 1) 

sup \iP- PN){eo f -iog)\<Lr{84\l \, ' +4 \. ' \ < 88Lr\ ' ^ ' 



N 



N 



N 



We complete the proof by using the relation log2 < m + 1 and A^ > A\n{m + 1). 
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